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Assume that each item from a set of items is classified into one of 
several categories. We then use the proportion of items classified into 
a category to estimate the true proportion of items from the category 
in the population. This article models the effect of misclassification 
error on the estimate of the true proportion. We discuss two conditions 
which can be used to determine the adequacy of a classifier. We 
present an optimal classification algorithm which can be used when 
the joint distribution of the variables on which classifications are 
based is known separately for items from each category. 

I. INTRODUCTION 

Let us suppose that we observe a set of items which can be split into 
several distinct categories. Each item is measured and classified by 
some device into one of these various categories. However, the classi- 
fied category for an item and the true category may not be the same, 
i.e., the device may make a misclassification error. The observed 
proportion of items in a category is then used to estimate the true 
proportion. 

The preceding scenario often occurs in quality control 1 and medical 
research. 2,3 In quality control, individual manufactured items from a 
sample or lot are often classified by a mechanical device as defective 
or not and the proportion of defectives in the sample is then used to 
estimate the proportion of defectives from the entire process. In 
medical research, the items are people and the idea is to estimate the 
proportion of people with various diseases. In a Bell system example, 
the items would be phone calls and the categories would be busies, 
completed calls, reorders, etc. An automated device would attempt to 
determine the true category for each call. The output of the device 
would then be the estimated proportion of calls in each category. This 
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last example motivated the analysis contained in this article. 4 Note 
that the problems discussed here are quite different from the tradi- 
tional classification problem, 5 where the goal is to maximize the 
probability of correctly classifying each item. In all the above examples, 
errors from misclassification can have a serious effect. 

In this article, we attack two separate problems. In the first problem, 
we assume we have little or no control over the internal design of the 
classifier; all that we have is an estimate of the probability of classifying 
items from each category into each of the other categories. The object 
is to develop a simple way to specify how good the classifier must be. 
Also, we should indicate to the designer the direction in which im- 
provements are necessary. The second problem handles the case when 
we do have control over the design of the classifier. In this case, we 
assume that object is to design the classifier so that the effects of 
misclassification error are minimized. In this article, we are concerned 
only with a classifier's ability to estimate proportions, i.e., our loss 
function is entirely different than the usual loss function. 

This article is organized in the following way. Section II introduces 
notation and explains the effects of misclassification error on the 
estimated proportion of items in a category. Section III discusses the 
case when we have little or no control over the design of the classifier. 
In Section IV, we discuss the case when we have control over the 
classifier. The resulting minimization problem involves a function that 
is quadratic in the probabilities of classifying items. 



II. EFFECTS OF MISCLASSIFICATION 

Let m be the number of categories. All vectors in this paper are m- 
dimensional column vectors, and matrices are m by m matrices. Also 
let p, p*. and p be the m-dimensional vectors of the true probabilities 
of items from the various categories, the probabilities of classifying 
items into various categories, and the observed proportion of items 
actually classified into different categories, respectively. Let A — (ay) 
be the m by m misclassification matrix which contains the conditional 
probabilities of classifying items into different categories. For example, 
eti 2 would be the probability of classifying an item from category 2 as 
an item from category 1. Each column of A sums to one. The diagonal 
elements of A are the probabilities of correct classification, and the 
off-diagonal elements give the probabilities of misclassification. A 
perfect classifier would have A = J, the identity matrix. By the law of 
total probability, p* = Ap. 

In measuring the effectiveness of p as an estimator of p, we use the 
matrix of mean-squared errors. The matrix of mean-squared errors 
contains the mean-squared error of the individual terms and also the 
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cross product terms which indicate how the different errors are related. 
The m by m matrix of mean-squared errors (mmse) is 

£[(P-P)(P-P)'|P] 

= cov(p | p) + [E (p | p) - p] [E (p | p) - p]', (1) 

where "cov" means the covariance matrix. The diagonal of the covar- 
iance matrix measures precision of an estimator while the diagonal of 
the second term in (1) is the bias squared. 

We assume that n items are to be classified. The expected number 
of items from each category is zip. If items are classified independently, 
the distribution of np will be a multinomial distribution with param- 
eters Ap and sample size n. From Ref. 6, we obtain the cov(p | p), 

cov(p |p) = [X>* - App'A']/n, (2) 

where D * is a diagonal matrix with diagonal equal to p*. The cov(p | p) 
can be separated into two parts, one part which is the covariance 
matrix if the classifier were perfect, and the second part which is an 
adjustment in the covariance matrix because the classifier is not 
perfect: 

cov(p|p) = [D - pp']/n + [D* - D - App'A' + pp']/n, 

where D is a diagonal matrix with diagonal equal to p. 
Since E(p\j>) = Ap, the bias term in (1) becomes 

[£(p|p) ~ P][£(P|P) " PT - (A - DVP'iA - /)'• (3) 

Note that this term is not divided by n. Putting the above statements 
together, (1) becomes 

£[(P - P) (P " P)' I P] - (D - PP')/n 

+ (D* - D - App'A' + pp')/n + [(A - J)pp'(A - /)']. (4) 

This equation includes the sampling-error effect (first term on right) 
and the effect of a misclassification error (other terms on right). 

III. SPECIFYING ACCURACY WITH LITTLE CONTROL OVER THE 
CLASSIFIER 

Assume that our information on a classifier is confined to the matrix 
A, i.e., someone else is responsible for the design of the classifier. We 
may influence the form of A but we have no control over the actual 
functioning of the classifier. Suppose we have a good idea of how large 
the mean-squared errors for each category estimate [the diagonal 
elements of (4)] can be for the application of the classifier. We need 
several guidelines on the form of A which will insure that the classifier 
and its resulting mean-squared errors are adequate. Clearly, we do not 
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want to put constraints on every element of A. Also, we cannot tell the 
designer of the classifier that the mean-squared errors of his classifier 
are inadequate without providing some guidance on how to improve 
the classifier. 

Intuitively, we would like to pick A so that the bias term in (4) 
disappears. We cannot pick A so that the bias term in (4) disappears 
for every p. For each A, however, there is some value of p for which 
the bias term disappears; in fact, A tends to produce p, which are 
collapsed toward the value of p for which the bias term disappears. A 
reasonable strategy is to pick A so that the bias term disappears for 
our "best guess" for p which we denote by po. This means A should be 
picked so that (A — /)po ~ 0, i.e., po is approximately an eigenvector 
of A with eigenvalue 1. There are many A which satisfy (A — 7)po » 
0, but which have large mean-squared errors for values of p near po. 
To ensure that mean-squared errors are small for values of p near po, 
we must additionally require that the a,, (diagonal elements of A ) be 
reasonably large; note that if the eigenvector condition is nearly 
satisfied, the requirement on the an may, in many cases, be quite loose. 

These two conditions: (1) (A — 7)po ~ 0, and (2) a„ large, are 
generally easy to check. Assume we have a set of s items which we 
know contain spo items from the respective categories. These items 
are classified and are the results used to estimate A. The first condition 
states that the number of items classified into a category is roughly 
equal to spo. If this condition is not originally satisfied, the designer 
can usually satisfy it by adjusting several thresholds which determine 
where the classifier places items. The second condition says that the 
classifier cannot misclassify a high proportion of items from any one 
category. The designer is then told which categories do not satisfy this 
condition. 

We now show that these two conditions can be justified analytically 
when we assume that p has an underlying Dirichlet distribution. That 
is, we now allow p to vary, for example, with environment. The 
Dirichlet distribution is the natural multivariate generalization of the 
beta distribution and it is the conjugate prior for the multinomial 
distribution. We pick the parameters of the Dirichlet distribution so 
that 

£(p)-Po, (5) 

cov(p) - -poPo/(i> + 1) + D /(v + 1), (6) 

where D is a diagonal matrix whose ith diagonal element is the ith 
element of po, and v is a parameter that indicates how spread out the 
Dirichlet distribution is. These parameters, p and v, can be chosen so 
that the resulting Dirichlet distribution models the expected environ- 
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mental variability of p. If we take expected values of the terms in (4) 
using (5) and (6), we obtain 

E(p - p) (p - p)' = DZ/n - AD A'/n(v + 1) 

- Ap<jp' A'v/n(v + 1) + (A - I)p p'o(A - I)'v/(v + 1) 

+ (A-I)Do(A-I)'/(v + l), (7) 

where D* is a diagonal matrix whose diagonal equals Ap . This 
equation incorporates the effects of varying p. If n and v are at all 
large, the fourth term in (7) will be important. If (A — 7)p ss holds, 
this term will drop out. The fifth term is minimized if the a,, are large. 
Since the second and third terms are generally unimportant [they are 
divided by n(v + 1) which should be large] and since the first term is 
present even with a perfect classifier, satisfying the two conditions will 
minimize the effects of misclassification. In short, we have presented 
a way to require the A matrix to be "near" the identity matrix without 
putting constraints on each and every element of A. 

IV. DESIGNING A CLASSIFIER THAT MINIMIZES MEAN-SQUARED 
ERROR 

Assume that our job is to develop a classifier that minimizes the 
effects of mean-squared error. More specifically, assume we measure 
a vector of variables (x) on each item. Regions Ri are defined such 
that if x G Ri, we classify the item into category i. We want to define 
the Ri that minimizes a weighted sum of the mean-squared errors for 
the various categories, i.e., that minimizes 

tr[Q£(p-p)(p-p)'], (8) 

where Q is a known positive-diagonal matrix, tr is the trace operator, 
and E(f> — p)(p — p)' is defined by (7). The matrix Q is just a 
weighting factor that can be used to emphasize the important cate- 
gories. Miriimizing (8) is equivalent to minimizing 

tr(QABA' - AC), (9) 



where 

R = 

n(v+ 1) 



B = -4 — ^77 (Do + popou), (10) 






— (Z)o + PoPou)--po(l, 1, ••■ ,1) )Q. (11) 

X/ WW 



We handle a more general case by allowing B to be any known 
symmetric, nonnegative definite matrix, and C any known matrix. 

Equation (9) is interesting because it is quadratic in A. The usual 
Bayes multiple decision rule 7 is a special case of (9), since it is the rule 
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that minimizes (9) when B = 0, and C is a diagonal matrix with 
diagonal equal to the vector of prior probabilities. Classical discrimi- 
nant analysis 5 is a special case of the Bayes multiple decision rule for 
which the distribution of x when the item comes from category i is 
multivariate normal with mean p, and covariance matrix £. 

Let fi(x) be the density of x if x is measured on an item from 
category i. The optimal classification algorithm is given in the following 
theorem: 

Theorem 1: Equation (9) will be minimized if x is classified into 
the ith category when the ith element of 

. (/i(x),/ 2 (x), • • • , fkto)(2BAQ' - C) (12) 

is the smallest. 
(Proof. See appendix.) 

In general, applying Theorem 1 should be quite difficult since A has 
to be solved for in (12). We now discuss the two-category case and 
then specialize to the case when x has a multivariate normal distri- 
bution. We give a simple iterative procedure to calculate the required 
quantities for this last case. 

For the two-category case, Theorem 1 reduces to the following 
corollary. 

Corollary 1: If there are only two categories of measurements, then 
eq. (9) will be minimized if x is placed into category 1 if 

f 1 (x)/f 2 (x)>K t (13) 

where 

K= [2622(911 + 922)021 - 2621(911 + 922)012 

+ 2(621911 ~ 622922) + (c 2 2 - C21)] 

/[26n(9ii + 922)012 - 2612(911 + 922)021 

+ 2(6i2922 ~ 611911) + (Cn - C12)], 



(14) 



and bij, qu, and Cy are elements ofB, Q, and C, respectively. 

Let us now assume that, if x is an observation from category 
i (i = 1 or 2), it has a multivariate normal distribution with known 
mean vector /i, and known covariance matrix £. Then (13) becomes 

log(/i(x)//2<x)) = x'S" 1 ^, - fi 2 ) - 1/2 (/*i 

+ /i2)'2- 1 (^i-/i 2 )>logK 

As in classical discriminant analysis, 5 the distribution of log(/i(x)/ 
fi (x)) is normal with mean a/2 or — a/2 when x comes from categories 
1 or 2, respectively, where 
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a 2 i 



a=(/xi-M2)'2- 1 (Mi-/i2). (15) 

In either case, the variance of log(/i(x)//" 2 (x)) is a. Therefore, when x 
comes from category 1, 

[log(/i(x)//- 2 (x))-a/2]/Va 

has a standard normal distribution. Similarly, 

[log (fi(x)// 2 (x)) + «/2]/Va 

has a standard normal distribution when x comes from category 2. The 
misclassiflcation probabilities can be defined in terms of a standard 
normal random variable, 

n („ logA"+a/2\ 
a 12 = P \Z> ' J , (17) 

where Z has a standard normal distribution. Using (14), (16), and (17) 
the values of an, Oai, and K may be calculated iteratively: 

(1) Let K = 1 and calculate an and a 2 i using (16) and (17); 

(2) Obtain a new value of K by substituting an and a 2 i into K\ and 

(3) Repeat the entire process. 

In summary, assume we are trying to minimize tr[QE(p — p) 
(p —/?)'], where Q is a known weighting matrix. Also assume we have 
only two categories and that if x comes from category i, it has a 
multivariate normal distribution with mean vector p* and covariance 
matrix 2. The parameter a may be calculated using (15). The B and C 
matrices should be calculated using (10) and (11). Equation (13) gives 
the decision rule for classification into one of the two categories, where 
the values of K, an, and a 2 i are obtained in an iterative manner using 
(14), (16), and (17). 

To apply any of the preceding theory, some prior knowledge of the 
distribution of p is required to estimate p and v. If the parameters /*i, 
/t 2 , and £ are unknown, they may be estimated from a sample of data 
with the usual sample means and pooled covariance matrix. An esti- 
mator of the parameter a could then be calculated using (15) with the 
estimates of fti, fi 2 , and £ substituted into (15). The algorithm discussed 
above could then be used to generate A and K where the estimator of 
a is used in (16) and (17) instead of a. The properties of the procedure 
when estimators of the parameters are used require further study. 

V. SUMMARY 

This article presents a model that incorporates the effects of mis- 
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classification error. Several guidelines are presented which can be used 
to determine if a given classifier is adequate. For the case when the 
classifier is yet to be designed, we have given an optimal classification 
algorithm. We discuss the two-category case in detail. 

APPENDIX 

Proof of Theorem 1 

Leti?° be the regions that result if the theorem is applied. Let Ao be 
the resulting misclassiflcation matrix. Let R] and Ai be the correspond- 
ing elements for some other decision rule. Now consider (9) evaluated 
at Ai minus (9) evaluated at Ao: 

tr (QAiBA\ - QAoBA'o - AiC + A Q C) 

- tr(i4i - Ao)B(Ai ~ AoYQ + 2tr(A, - A )BA' Q 

-tr(Ai-A )C. (18) 

The first term on the right of (18) is nonnegative since B and Q are 
nonnegative definite. We still have to show that the rest of (18) is 
nonnegative. Consider 

2tr(i4, - A )BA'oQ - tr(A, -A )C= ttAiE - trA E, (19) 

where E = 2BAoQ — C. Let ey be the i, y'th element of E. Equation 
(19) now becomes 

tr(Ai£) - tr(AoE) = J I fc(i|x)£(x) dx-eji 

fo(k\x.)f m {x) dn-e mk 






= £ I *i(*|x)*,(A|x) 



S^-(x)e>,-- £/m(x)e m * 



rfx, (20) 



where 



♦»«w-ft 3$: 

,.. . fi, xeRj, 
* l(i|x) -l0, x$Rl 

Since fo (k | x) will be zero whenever [£> //(xey,- - £m fn&mk\ is negative, 
(20) is always nonnegative. Q.E.D. 
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